Goal

Analyze the adhesin properties of the Hyr/Iff-like (Hil) family in the Sacchromycetes This is version 4 of the analysis, using the expanded blast hits on 2022-05

Adhesin predictions

The goal of the first analysis is to assess the evidence for each Hil family member as encoding yeast adhesins. We rely on a ML-based prediction algorithm and the following known sequence features of adhesins: signal peptide + GPI-anchor, tandem repeats, high S/T frequency (possibly glycosylation)

Basic information

First get the basic information about the sequences in this study.

#sps.list <- c("Cduobushaemulonis","Cpseudohaemulonis","Chaemuloni","Cauris","Clusitaniae","Dhansenii","Cparapsilosis","Lelongisporus","Ctropicalis","Cdubliniensis","Calbicans","Sstipitis","Klactis","Ncastellii","Cglabrata","Nbracarensis","Ndelphensis","Nnivariensis")
blastInfo <- read_tsv("../data/expanded-blast-homologs-info.tsv", col_types = cols())# %>% 
#  mutate(species_id = factor(species, levels = sps.list), species = NULL)

ML adhesin predictions

FungalRV is a Support Vector Machine (SVM) based prediction algorithms that use sequence features such as amino acid composition (frequency, physiochemical properties etc.) as input and train Machine Learning models to distinguish fungal adhesins from non-adhesins.

frv.th = 0.511 # recommended FungalRV score threshold
frv <- read_tsv("../output/FungalRV/fungalRV-results.txt", skip = 3, col_names = c("name","frv.score"), col_types = "cd") %>% 
  mutate(name = str_sub(name, 2), frv.pred = frv.score > frv.th)
#if("frv.score" %in% names(seqInfo))
#  seqInfo <- select(seqInfo, -frv.score, -frv.pred, -faa.score, -faa.pred)
#seqInfo <- seqInfo %>% left_join(frv) %>% left_join(faa)

SP and GPI-anchor prediction

GPI-anchored proteins are characterized by an N-terminal signal peptide, which would direct the protein to the secretary pathway, and a C-terminal GPI-anchor peptide, which would be cleaved and replaced by the GPI-anchor, allowing the protein to be tethered to the cell wall.

For signal peptide, I used the SignalP server. Its latest version is 6.0.

# Signal peptide
gff.names <- c("name", "source", "type", "start", "end", "prob", "na1", "na2", "na3")
signalp6 <- read_tsv("../output/web-download/signalp_6.0_result.gff3", comment = "#", col_names = gff.names, col_types = "ccciidccc")

#if("signalp" %in% names(seqInfo))
#  seqInfo <- select(seqInfo, -signalp)
#
#seqInfo <- left_join(seqInfo, select(signalp5, name = id, prob), by = c("name" = "name")) %>% 
#  mutate(signalp = !is.na(prob)) %>% select(-prob)

For GPI-anchor prediction, I used the PredGPI server.

tmp <- read_delim("../output/web-download/predgpi-result-headline-only.txt", delim = "|", col_names = c("name","fp","omega"))
Rows: 215 Columns: 3── Column specification ───────────────────────────────────────────────────────────────────────
Delimiter: "|"
chr (3): name, fp, omega
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pred.gpi <- tmp %>% 
  mutate(name = str_sub(name,2,-2), # remove > and the trailing space
         fp = as.numeric(str_sub(fp, 9, -2)), # extract the numeric part
         gpi.pred = fp <= 0.01,    # based on the cutoff of the PredGPI server (prob < 99% -> not GPI-anchored)
         omega = str_sub(omega, 8),
         cleaveRes = str_sub(omega, 1, 1),
         cleavePos = as.integer(str_sub(omega, 3))
         )
# remove the column if it already exists
#if("pred.gpi" %in% names(seqInfo))
#  seqInfo <- select(seqInfo, -pred.gpi)
#seqInfo <- left_join(seqInfo, select(pred.gpi, name, pred.gpi = is.gpi), by = c("name"="name"))

Combine the ML, SignalP and GPI-anchor predictions

adhesin <- blastInfo %>% 
  select(name, species, len = slen, pComplete) %>% 
  left_join(frv, by = "name") %>% 
  left_join(signalp6 %>% select(name, sp.prob = prob), by = "name") %>% 
  left_join(pred.gpi %>% select(name, gpi.pred, cleavePos), by = "name") %>% 
  mutate(sp.pred = !is.na(sp.prob)) %>% 
  relocate(c(frv.pred, sp.pred, gpi.pred), .after = pComplete)

Export the adhesin summary table

write_tsv(adhesin, file = "../output/table/20220818-Hil-family-size-adhesin-status-summary.tsv")

Plot the results alongside the species tree. First import the species tree

spsInfo <- read_tsv("../data/20220518-expanded-blast-species-info.tsv", col_types = cols())
sps.tree <- read.tree("../data/20220521-generax-species-tree.nwk") %>% 
  as_tibble() %>% 
  mutate(label = gsub("_", " ", label)) %>% 
  left_join(spsInfo, by = c("label" = "species")) %>% 
  as.treedata()
# to label the clades
clade <- c(
  MDR = MRCA(sps.tree, "Candida auris", "Candida duobushaemulonis"),
  CaLo = MRCA(sps.tree, "Candida parapsilosis", "Candida tropicalis"),
  glabrata = MRCA(sps.tree, "Candida glabrata", "Candida nivariensis")
)
sps.tree <- groupClade(sps.tree, clade)

Because we will plot the results separately, it is important to generate the species tree object and extract the order of the species, so as to match the numbers to the species.

p.tree <- ggtree(sps.tree, ladderize = FALSE) + xlim(0,2.2) + scale_y_reverse() +
  #geom_tiplab(aes(color = pathogen), as_ylab = TRUE) +
  geom_tiplab(size = 3.2, fontface = "italic", align = TRUE, linesize = 0.1, offset = 0.05) +
  geom_treescale(x = 0, width = 0.2, linesize = 1.2) +
  geom_hilight(node = clade["MDR"], fill = "#7F00FF", alpha = 0.15)  + # MDR
  geom_hilight(node = clade["CaLo"], fill = "pink", alpha = 0.25)    + # Candida/Lodderomyces
  geom_hilight(node = clade["glabrata"], fill = "steelblue", alpha = 0.15)  + # glabrata
  #geom_cladelabel(node = clade["MDR"],  label = "MDR", offset = 1.5,# color = "purple",
  #                offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# MDR
  #geom_cladelabel(node = clade["CaLo"],  label = "Candida/\nLodderomyces", offset = 1.47,# color = "hotpink2",
  #                offset.text = 0.1, angle = 270, hjust = .5, extend = 0.5, fontsize = 3.5) + # albicans
  #geom_cladelabel(node = clade["glabrata"],  label = "glabrata", offset = 1.38,# color = "steelblue", 
  #                offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# glabrata
  geom_tippoint(aes(color = pathogen)) +
  scale_color_manual(values =  c("crustacean" = "#6a5acd",
                                 "human" = "#d14949", 
                                 "human (rare)" = "steelblue",
                                 "no report" = "gray20")) +
  #guides(color = guide_legend(byrow = TRUE)) +
  theme(legend.position = c(0.12, 0.13))
Scale for 'y' is already present. Adding another scale for 'y', which will replace the
existing scale.

Summarize the results

df0 <- adhesin %>% 
  mutate(species = factor(species, levels = rev(get_taxa_name(p.tree)))) %>% 
  group_by(species) %>% 
  summarize(total = n(), FRV = sum(frv.pred), SP = sum(sp.pred), GPI = sum(gpi.pred), 
            final = sum(frv.pred & sp.pred & gpi.pred)) %>% 
  pivot_longer(total:final, names_to = "type", values_to = "number") %>% 
  mutate(type = factor(type, levels = unique(type))) %>% 
  # complete missing values for S. cerevisiae
  complete(species, type, fill = list(number = 0))

p <- ggplot(df0, aes(x = type, y = species)) + 
  scale_y_discrete(limits = rev) +
  geom_tile(aes(fill = number), color = "white", alpha = 0.4) +
  geom_text(aes(label = number), color = "black", size = 4) +
  scale_fill_distiller(palette = "Greys", direction = 1, limits = c(0, 20), oob = scales::squish) +
  scale_x_discrete(position = "top") +
  theme_cowplot() + theme(axis.title = element_blank(), axis.line = element_blank(), legend.position = "none")

ggsave(p, file = paste0("../output/img/",gsub("-", "", Sys.Date()), "-species-adhesin-prediction-summary.png"), width = 5, height = 7)
p1 <- p + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
plot_grid(p.tree, p1, rel_widths = c(3,2), scale = c(1, 0.95))

Summarize the % of Hil genes passing all three tests

df0 %>% pivot_wider(names_from = "type", values_from = "number") %>% 
  select(species, total, final) %>% 
  summarize(total = sum(total), adhesin = sum(final))

Short or no GPI-anchor

A subset of the identified Hil homologs are short and/or missing either the signal peptide or the GPI anchor. We need to first filter out ones that are incomplete (sequence record) before examining the remaining ones.

First, check the protein length distribution. M. bicuspidata is an outlier in that 27 of the 29 Hil homologs in this species are shorter than 500 aa and 10 were annotated as being incomplete in the RefSeq database.

adhesin %>% 
  mutate(`M. bicuspidata` = species == "Metschnikowia bicuspidata",
         protein = ifelse(grepl("no", pComplete), "Incomplete", "Complete")) %>% 
  ggplot(aes(x = len, fill = `M. bicuspidata`)) + 
  geom_histogram(binwidth = 100) + 
  geom_vline(xintercept = 600, linetype = 2, color = "gray20") +
  scale_fill_manual(values = c("gray50", "red3")) +
  facet_wrap(~protein, nrow = 2, labeller = "label_both") +
  xlab("Protein length") + ylab("Frequency") +
  theme_cowplot() + theme(legend.title = element_text(face = 3))
ggsave(file = "../output/img/20221002-homologs-length-distribution-Mb.png", width = 6, height = 4)

Filter out incomplete entries and focus on those that either are short or miss SP/GPI

# short and missing SP/GPI
adhesin %>% 
  filter(!grepl("no", pComplete), len < 600, len > 250) %>% 
  group_by(sp.pred, gpi.pred, species) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(group = paste0(ifelse(sp.pred, "SP+", "sp-"), ifelse(gpi.pred, "GPI+", "gpi-"))) %>% 
  select(group, species, n) %>% 
  pivot_wider(names_from = group, values_from = n, values_fill = 0) %>% 
  mutate(total = rowSums(across(`sp-gpi-`:`SP+GPI+`), na.rm = TRUE)) %>% 
  arrange(desc(total)) %>% 
  write_tsv(file = "../output/table/20221003-short-and-missing-SP-GPI.tsv")

# long and missing SP/GPI
adhesin %>% 
  filter(!grepl("no", pComplete), len >= 600) %>% 
  group_by(sp.pred, gpi.pred, species) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(group = paste0(ifelse(sp.pred, "SP+", "sp-"), ifelse(gpi.pred, "GPI+", "gpi-"))) %>% 
  select(group, species, n) %>% 
  pivot_wider(names_from = group, values_from = n, values_fill = 0) %>% 
  mutate(total = rowSums(across(`sp-gpi-`:`SP+GPI+`), na.rm = TRUE)) %>% 
  arrange(desc(total)) %>% 
  write_tsv(file = "../output/table/20221003-long-and-missing-SP-GPI.tsv")

# all length
adhesin %>% 
  filter(!grepl("no", pComplete)) %>% 
  mutate(length = cut(len, breaks = c(0, 250, 600, 5000), labels = c("0-250", "251-600", ">600"))) %>% 
  group_by(sp.pred, gpi.pred, length) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(group = paste0(ifelse(sp.pred, "SP+", "sp-"), ifelse(gpi.pred, "GPI+", "gpi-"))) %>% 
  select(group, length, n) %>% 
  pivot_wider(names_from = group, values_from = n, values_fill = 0) %>% 
  mutate(total = rowSums(across(`sp-gpi-`:`SP+GPI+`), na.rm = TRUE)) %>% 
  arrange(desc(total)) %>% 
  write_tsv(file = "../output/table/20221003-all-missing-SP-GPI.tsv")
  

Feature map for homologs

The goal is to produce a schematic plot for each homolog outlining their main features, such as the locations of the PFam domains (mainly the Hyp_reg_CWP), locations of the signal peptide and GPI-anchor, distribution of TANGO sequences. Note that all these features can be represented as a range with associated metadata. So the first step is to collect the coordinates of the features

Gene tree for organizing the features

# load the gene tree
gene.tree <- read.nhx(file = "../data/20220512-generax-clustalo-shen2018-wScer-gene-tree.nhx") %>% 
  drop.tip(tip = Mb.rm$name)
# add supplemental information
clades <- sps.tree %>% as_tibble() %>% select(treeName, group) %>% na.omit()
gene.tree <- left_join(gene.tree, select(spsInfo, treeName, family), by = c("S" = "treeName")) %>% 
  left_join(clades, by = c("S" = "treeName"))# %>% 
  #mutate(family = forcats::fct_relevel(family, "Metschnikowiaceae", after = Inf))
# label selected species to show the clustering
selected_nodes <- gene.tree %>% as_tibble() %>% 
  filter(S %in% c("Candida_albicans", "Candida_glabrata", "Candida_auris")) %>% 
  pull(node)
# color by family
clade.cols <- c(
  "CaLo" = "firebrick",
  "MDR" = "#7F00FF",
  "glabrata" = "steelblue"
)
p.gene.tree <- ggtree(gene.tree, size = 0.4) + xlim(0,4) + 
  #geom_nodepoint(aes(fill = D), data = td_filter(D == "Y"), shape = 21, size = 1, color = "red") + 
  geom_tippoint(aes(color = group), size = 1) +
  geom_tiplab(aes(color = family),
               align = TRUE, linesize = 0.1, size = 1, offset = 0.05) +
  geom_nodelab(aes(x = branch, label = node), size = 1) +
  #geom_cladelab(node = 357,  label = "M. bicuspidata", offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +
  #geom_text(label = "D", data = td_filter(D == "Y"), hjust = -0.4, size = 1.5, color = "red")# +

  scale_color_manual(name = "Clade", values = clade.cols)# +
p.gene.tree

ggsave(filename = "../output/img/20220916-gene-tree-side.png", width = 5, height = 7)

Extract gene tree order

genetreeOrder <- get_taxa_name(p.gene.tree)

Organize and combine the tandem repeats features

# summarize stats of tandem repeats
repeats <- tandem %>% 
  group_by(type, period) %>% 
  summarize(n = n(), copyMean = mean(copyN), .groups = "drop") %>% 
  mutate(length = period * copyMean)

tr <- tandem %>% 
  left_join(select(repeats, type, copyMean), by = c("type" = "type")) %>% 
  filter(name %in% genetreeOrder) %>% 
  mutate(type = paste0("TR-", type),
         tip = paste0(consensus_nogap,
                      "\ntype: ", type,
                      "\nperiod: ", period, 
                      "\ncopyN: ", copyMean),
         name = ordered(name, levels = rev(genetreeOrder))) %>% 
  select(name, type, start, end, tip)

Make a seqLen object to draw the overall length of each protein

seqLen <- blastInfo %>% 
  mutate(start = 1, name = ordered(name, levels = rev(genetreeOrder))) %>% 
  select(name, start, end = slen)
# GPI-anchor
# use pred.gpi
# feature set
# structure: id  feature  start  end
feature <- bind_rows(
  Hyphal_reg_CWP = pf11765,
  # extend the signal peptide segment by 10 amino acids to make it more visible
  `Signal Peptide` = signalp6 %>% mutate(end = end + 10) %>% select(name, start, end),
  # extend the GPI-anchor C-terminus segment by 20 amino acids to make it more visible
  `GPI-anchor` = pred.gpi %>% filter(gpi.pred) %>%
    left_join(select(blastInfo, name, slen), by = "name") %>% 
    mutate(start = cleavePos-10, end = slen) %>% 
    select(name, start, end),
  `Tandem Repeats` = tr %>% select(name, start, end),
  .id = "type"
) %>% filter(!name %in% Mb.rm$name)

feature <- feature %>% 
  mutate(name = ordered(name, levels = rev(genetreeOrder)),
         type = ordered(type, levels = c("Hyphal_reg_CWP", "Signal Peptide", "GPI-anchor", "Tandem Repeats")))
feature.colors <- c("Hyphal_reg_CWP" = "#3d85c6", "Signal Peptide" = "#cc0000", "GPI-anchor" = "#6a3d9a", "Tandem Repeats" = "#af8400bb")

Plot domain organization

h = 1.2
# plot
p0 <- ggplot(seqLen, aes(x = name, y = start, xend = name, yend = end)) + 
  geom_segment(color = "gray80", size = h)
p1 <- geom_segment(data = feature,  aes(color = type), size = h)
p2 <- geom_segment(data = tandem.div, aes(x = name, xend = name, y = div, yend = div + 1.5),
                   size = h, color = "white")
p3 <- p0 + p1 + coord_flip() + theme_classic() + 
  scale_color_manual(values = feature.colors) +
  scale_y_continuous(expand = expansion(mult = c(0.02, 0.02))) +
  theme(axis.text.y = element_text(size = 3), axis.title.y = element_blank(),
        axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
        legend.position = c(0.8, 0.9), 
        legend.text = element_text(size = 12), legend.title = element_text(size = 14),
        plot.title = element_text(hjust = 0.5), 
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  labs(y = "Position in sequence", x = "Sequences", color = "FEATURES")
#plot_grid(p.gtree, p3, ncol = 2, align = 'v', rel_widths = c(1,2), scale = c(1.01,1))
ggsave("../output/img/20220916-domain-tandem-repeats.png", plot = p3, width = 6, height = 7.5)

Fig. ? Blue boxes indicate the PF11765 domains while all other non-grey boxes indicate XSTREAM-determined tandem repeat domains. Colors are used to group highly similar tandem repeats. The black thin lines demarcate adjacent tandem repeat units. The table below shows the copy number, period and consensus sequence for each tandem domain organized by the host sequences.

Separate TR types

#DT::datatable(
#  tandem %>% 
#    dplyr::rename(seqL = seqLength, err = consensus_error, seq = consensus_nogap) %>% 
#    select(-seqAlign, -type, -consensus_gap, -seq, seq) %>% 
#    arrange(desc(name)),
#  fillContainer = FALSE, options = list(pageLength = 10)
#)

Repeat the above analysis but distinguishing between all different TR types

require(RColorBrewer)
Loading required package: RColorBrewer
tr.th <- 100 # arbitrary threshold for distinguishing "short" from "long" TRs
tr.col <- character(nrow(repeats)) # create a color vector
short.rp <- which(repeats$length < tr.th) # identify the short repeats indices
long.rp <- setdiff(1:nrow(repeats), short.rp) # the long repeats indices
set.seed(123) # for reproducibly shuffling the order before assigning the colors
short.rp <- sample(short.rp) # shuffle the indices for short tandem repeats
tr.col[short.rp] <- colorRampPalette(brewer.pal(12, "Paired")[seq(3,11,by=2)])(length(short.rp)) # assign the short repeats a lower contrast color
set.seed(231) # for reproducibly shuffling the order before assigning the colors
long.rp <- sample(long.rp)
tr.col[long.rp] <- colorRampPalette(brewer.pal(12, "Paired")[seq(4,12,by=2)])(length(long.rp)) # assign the long repeats a higher contrast color
# -- desaturate the colors -- 
# https://stackoverflow.com/questions/26314701/r-reducing-colour-saturation-of-a-colour-palette
library(colorspace)   ## hsv colorspace manipulations

## Function for desaturating colors by specified proportion
desat <- function(cols, sat=0.5) {
    X <- diag(c(1, sat, 1)) %*% rgb2hsv(col2rgb(cols))
    hsv(X[1,], X[2,], X[3,])
}

tr.col <- desat(tr.col, sat = 0.8)
# -- finish --
tr.col <- paste0(tr.col, "CC") # add 20% transparency to the TR features
names(tr.col) <- paste0("TR-", repeats$type) # name the colors by the TR types
repeats$color <- tr.col

Combine domains, SP and GPI-anchor with TR features.

# combine sequence features with tandem repeats
feature1 <- bind_rows(filter(feature, type != "Tandem Repeats"), tr) %>% 
  mutate(type = ordered(type, levels = c("Hyphal_reg_CWP", "SignalP", "GPI-anchor", unique(tr1$type))))
Error in `mutate()`:
! Problem while computing `type = ordered(...)`.
Caused by error in `h()`:
! error in evaluating the argument 'x' in selecting a method for function 'unique': object 'tr1' not found
Backtrace:
  1. ... %>% ...
 10. base::.handleSimpleError(`<fn>`, "object 'tr1' not found", base::quote(unique(tr1$type)))
 11. base h(simpleError(msg, call))
# plot
p1 <- geom_segment(data = feature1, aes(color = type, text = tip), size = h)
p2 <- geom_segment(data = tandem.div, aes(x = name, xend = name, y = div, yend = div + 1.5),
                   size = h, color = "white")
p3 <- p0 + p1 + coord_flip() + theme_classic() + 
  scale_color_manual(values = feature.col1) +
  scale_y_continuous(expand = expansion(mult = c(0.02, 0.02))) +
  theme(axis.text.y = element_text(size = 3), axis.title.y = element_blank(),
        axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
        legend.position = "none", plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  labs(y = "Position in sequence", x = "Sequences", color = "FEATURES")
# plot_grid(p.gtree, p3 + p2, ncol = 2, align = 'v', rel_widths = c(1,2), scale = c(1.01,1))
ggsave("../output/img/20220916-domain-tandem-repeats-distinct-color.png", plot = p3, width = 5, height = 7.5)
# plot
#require(plotly)
plotly::ggplotly(p3, tooltip = "text", width = 900, height = 900)
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Tango predicted β-aggregation sequences

# reorder the sequences for plotting
# plot
#p1 <- ggplot(seqLen, aes(x = name, y = start, xend = name, yend = end)) + geom_segment(color = "gray80", size = 2)
p1 <- geom_segment(data = filter(pf11765, !name %in% Mb.rm$name), 
                   aes(x = name, xend = name, y = start+1, yend = end), size = h, color = "#3d84c6")
p2 <- geom_segment(data = filter(tango, !name %in% Mb.rm$name), 
                   aes(x = name, xend = name, y = ifelse(start-4 >= 0, start-4, 0), yend = end + 4, color = median), size = h)
p3 <- p0 + p1 + p2 + coord_flip() + theme_classic() + 
  scale_color_viridis_c(limits = c(5,101), breaks = c(5,seq(25,100,25)), direction = -1) +
  theme(axis.text.y = element_text(size = 4), axis.title.y = element_blank(),
        axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.line.x = element_blank(), axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(0.8,0.88),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  ylim(-2, 4500) + labs(y = "Position in sequence", x = "Sequences", color = "TANGO score")
p3

##plot_grid(p.gtree, p4, ncol = 2, align = 'v', rel_widths = c(1,2.5), scale = c(1.01,1))
ggsave("../output/img/20220916-tango-score-segment.png", plot = p3, width = 6, height = 7.5)

To compare the number of TANGO motifs in C. auris Hil1-4 and their close homologs to the rest of the Hil family, we first identify the first group of sequences

a <- which(genetreeOrder == "XP_025344416.1_Candida_haemuloni", arr.ind = TRUE)
b <- which(genetreeOrder == "XP_024716365.1_Candida_pseudohaemulonii", arr.ind = TRUE)
hil1_4_mdr <- genetreeOrder[a:b]
rm(list = c("a", "b"))
motif.per.seq %>% 
  select(name, n.all) %>% 
  mutate(group = ifelse(name %in% hil1_4_mdr, "Hil1-4_MDR", "others")) %>% 
  group_by(group) %>% 
  summarize(median = median(n.all))

Discussion

  • The one sequence below 500 a.a. is from N. delphensis, which is labeled as a partial CDS.
  • Majority of the proteins in the list are 500-2000 a.a., with a few exceptionally long
  • Not only do Saccharomycetaceae species have fewer Hil family homologs, the ones they have are also short (< 1000 a.a.) with the exception of C. glabrata

Chromosomal locations

Are members of this protein family enriched in the subtelomeric regions?

A recent long-read sequencing study for C. glabrata annotated 31 novel ORFs, of which 24 are GPI-Cell Wall Proteins. The authors cited previous literature supporting the in the subtelomeric regions Xu 2020. While performing BLAST on FungiDB to identify homologs of our protein, I also noticed that many of the hits appear to be at the beginning and end of the chromosomes.

To test if there is indeed a significant enrichment among this family of proteins in the subtelomeric regions, I collected the chromosomal locations for a subset of the species whose genomes were assembled to a chromosomal or complete genome level (the two are somewhat equivalent). These include Candida albicans, Candida auris, Candida dubliniensis, Candida orthopsilosis, Debaryomyces hansenii, Kazachstania africana, Kazachstania naganishii, Kluyveromyces lactis, Naumovozyma castellii, Naumovozyma dairenensis, Candida glabrata, Scheffersomyces stipitis

To formally test this hypothesis, I need to account for the background gene density differences along the chromosomes. The idea is to compare the chromosomal positions for this group of proteins compared with the gene densities on the chromosomes they reside on.

One consideration for this analysis is that if we simply use all the species for which a well assembled genome is available, we may end up over-sampling a phylogenetic clade and having its Hil family’s localization over-weighted in the final test. To avoid giving too much weight on a group of species, I will select a representative species if more than one is available in a relatively closely related clade. For example, we will use Candida albicans and remove Candida dubliniensis.

# import the chromosomal location information
chrLoc <- read_tsv("../data/20220919-expanded-blast-chromosomal-locations.tsv", col_types = cols())
# decide the species to be used for the analysis
use.sps <- c("Candida albicans", "Debaryomyces hansenii", "Candida orthopsilosis",
             "Kazachstania africana", "Kluyveromyces lactis",
             "Naumovozyma dairenensis", "Candida auris", "Candida glabrata")
# create a new tibble for our homologs from these species
fg.freq <- chrLoc %>% filter(species %in% use.sps)
# import assembly info
assembly.info <- read_tsv("../data/20220525-expanded-blast-species-assembly-info.tsv",
                          col_types = cols())
use.sps <- assembly.info %>% 
  filter(species %in% use.sps) %>%
  filter(!species %in% c("Candida glabrata", "Candida auris", "Kluyveromyces lactis")) %>% 
  select(species, assembly = assemblyaccession) %>% 
  # manually add several species. note that we are using the B11245 (Clade 4) strain to 
  # represent C. auris, because it is assembled to a chromosome level and has annotation files
  add_row(species = c("Candida glabrata", "Candida auris", "Kluyveromyces lactis"),
          assembly = c("GCA_010111755.1", "GCA_008275145.1", "GCF_000002515.2"))

To conduct this test, we first need to prepare and compute the background gene densities. To do this, we will gather the genome assembly files for the selected species, read them into R, and generate a table that contains one row for each gene, with its gene ID, chromosome number, chromosome length and the start position expressed as a percentage measured from the chromosome ends.

# 1. prepare file names
#   get all file names that ends with "feature_table.txt.gz", which contain the gene annotation
feature.files <- list.files(path = "../data/assembly-info/", pattern = "*feature_table.txt.gz$")
names(feature.files) <- sapply(str_split(feature.files, pattern = "_"), function(x) {
  paste(x[1], x[2], sep = "_")
})
# get all file names that ends with "assembly_report.txt", which contain the chromosomal length
assembly.files <- list.files(path = "../data/assembly-info/", pattern = "*assembly_report.txt$")
names(assembly.files) <- sapply(str_split(assembly.files, pattern = "_"), function(x) {
  paste(x[1], x[2], sep = "_")
})
use.sps$FeatureFile <- feature.files[use.sps$assembly]
use.sps$AssemblyFile <- assembly.files[use.sps$assembly]
# 2. read in the assembly information
feature.col.names <- c("feature","class","assembly","assembly_unit","seq_type","chromosome","genomic_accession","start","end","strand","product_accession","non_redundant_refseq","related_accession","name","symbol","GeneID","locus_tag","feature_interval_length","product_length","attributes")
assembly.col.names <- c("chromosome","seq_role","assign_molecule","type","gb_acc","relationship",
                        "refseq_acc","assembly_unit","seqL","ucsc_name")
compute.bg.freq <- function(row){
  assembly.file <- row["AssemblyFile"]
  assembly <- read_tsv(paste0("../data/assembly-info/",assembly.file), comment = "#",
                       col_names = assembly.col.names, col_types = "ccccccccic")
  feature.file <- row["FeatureFile"]
  feature <- read_tsv(paste0("../data/assembly-info/",feature.file), col_names = feature.col.names,
                      col_types = "ccccccciicccccccciic", skip = 1)
  res <- feature %>% 
    filter(feature == "mRNA") %>% 
    # these feature tables are organized hierarchically, with the top level being "gene"
    # the next level one of "mRNA", "ncRNA", "tRNA" or "rRNA". we only count protein-coding genes, i.e.
    # "mRNA". the reason I didn't select the "CDS" feature type is because in a small number of cases,
    # one mRNA feature contains more than one CDS feature, possibly due to splicing or alternative 
    # translational start site
    select(chromosome, start, end) %>% 
    left_join(assembly %>%
                mutate(chraccver = ifelse(refseq_acc != "na", refseq_acc, gb_acc)) %>% 
                select(chromosome, chraccver, seqL), 
              by = c("chromosome" = "chromosome")) %>%
    mutate(relLoc = round(start / seqL, 3))
  return(res)
}
# 3. apply the function to the genomes, but leave out C. auris
bg.freq <- apply(use.sps, MARGIN = 1, compute.bg.freq)
names(bg.freq) <- use.sps$species

Let’s take a look at the gene density in one of the genomes, e.g. D. hansenii

Below I use the geom_histogram function instead, with breaks specified manually. This results in a density distribution that is pretty flat across the chromosomes.

bg.freq$`Debaryomyces hansenii` %>% ggplot(aes(x = relLoc)) + 
  geom_histogram(breaks = seq(0,1,0.05)) +
  facet_wrap(~ chromosome) + ggtitle("D. hansenii") + theme(plot.title = element_text(hjust = 0.5))

In order to compare the distribution of all genes in different bins of the chromosomes to the homologs in our case study, we can divide each chromosome in the nrow(use.sps) genomes into an arbitrary number of bins after “folding” them in half, e.g. 0-10%, 10-20%, 20-30%, 30-40% and 40-50%. To be able to visually compare the distribution of our homologs and the genome background, we will create a special “chromosome” class that will be our homologs and combine them with the bg.freq table.

freq.bins <- c(-0.001, seq(0.1, 0.5, 0.1)); freq.binsL <- c(0, freq.bins[-1])
freq.label <- paste0(head(freq.binsL, -1)*100,"-",tail(freq.binsL, -1)*100,"%")
bg.freq.tb <- bind_rows(bg.freq, .id = "species") %>% 
  mutate(fold.relLoc = ifelse(relLoc <= 0.5, relLoc, 1-relLoc),
         bin = cut(fold.relLoc, breaks = freq.bins, labels = freq.label))

# add the homologs
fg.freq <- fg.freq %>% 
  mutate(fold.relLoc = ifelse(relLoc <= 0.5, relLoc, 1-relLoc),
         bin = cut(fold.relLoc, breaks = freq.bins, labels = freq.label)) 
freq.plot <- fg.freq %>%
  mutate(chromosome = "Hil") %>%  # we label the homologs as Hil to make it a separate class
  select(species = species, chromosome, bin) %>% 
  bind_rows(select(bg.freq.tb, species, chromosome, bin)) %>% 
  mutate(chromosome = ordered(chromosome, levels = c(1:12,LETTERS[1:13],"Hil"))) %>% 
  group_by(species) %>% 
  filter(sum(chromosome == "Hil") >=3) # remove species that have fewer than 3 Hil homologs
# plot
freq.plot %>% 
  #mutate(species = paste0(substr(species, 1, 1), ". ", substr(species, 2, 15))) %>% 
  ggplot(aes(x = chromosome, group = bin, fill = bin)) + 
  geom_bar(position = position_fill()) +
  facet_wrap(~ species, scales = "free_x", ) + 
  #scale_fill_brewer("Distance from\nchromosome end", type = "qual", palette = 3) +
  scale_fill_viridis_d(direction = -1, end = 0.95, alpha = 0.9) +
  scale_y_continuous(name = "Cumulative % of genes", trans = "reverse", breaks = seq(0,1,0.2)) + 
  theme_cowplot() + panel_border(color = "grey80") +
  theme(legend.position = "none", strip.background = element_blank(),
        strip.text.x = element_text(face = 3),
        axis.text = element_text(size = rel(0.7)))

ggsave("../output/img/20220920-compare-homologs-chromosomal-locations-to-bg.png", width = 7, height = 4)

In the plot above, we can see that the distribution of the Hil proteins on the chromosome deviate from the background. Only species with three or more Hil family members are shown here.

Now that we have the background gene frequencies computed, we can start constructing a test that tests for significant departure in the chromosomal locations of our protein family from the background frequencies. One idea is to divide the chromosome into equal-sized bins and use the frequencies of all genes in each bin as the multinomial probabilities in the null hypothesis. If our proteins were randomly selected from each chromosome without regard to their location, we would expect their locations to conform to the background frequencies. This can be tested using an exact multinomial test or an approximate Chi-square test or G-test. Since we don’t distinguish the two ends of a chromosome, we can “fold” the chromosome “in half” and measure each gene’s location as a percentage from the closer end, i.e. the relative location ranging from 0%-50%. If we can reject the null hypothesis, that constitutes evidence that the proteins from our group are not randomly selected from the background set.

Result of multinomial test

# install XNomial package for the multinomial exact test function
# https://cran.r-project.org/web/packages/XNomial/vignettes/XNomial.html#e1
while(!require(XNomial))
  suppressMessages(install.packages("XNomial"))
Loading required package: XNomial
# calculate pooled background frequency
bg.cnt <- tabulate(bg.freq.tb$bin)
fg.cnt <- tabulate(fg.freq$bin)
xmulti(obs = fg.cnt, expr = bg.cnt, detail = 3)

P value  (LLR)  =  1.301e-12
P value (Prob)  =  4.7e-13
P value (Chisq) =  2.359e-13

Observed:  34 7 2 8 1 
Expected ratio:  8602 9027 9021 9090 9028 
Total number of tables:  367290 

Note that the three P-values correspond to three approaches of testing the goodness-of-fit. The LLR, which stands for Log Likelihood Ratio, is generally preferred. But all three said the same thing: the observation deviates from the background frequency significantly. By looking at the plots above, it is clear that the deviation is due to the excess of our homologs residing in the 0-10% bin, which is the tip of the chromosomes.

The test above assumes that the gene density along all chromosomes in the eight species follow the same distribution. The empirical observation supports that hypothesis. Should that to be not the case, we could also account for the variability in gene density distribution between species or even between chromosomes within a species. The idea is to first collect the chromosomes that our homologs come from, and then randomly sample the same number of genes as the number of our homologs on that chromosome. Do this for all of the homolog-containing chromosomes, we would obtain one random sample. Repeat this process 1000 times or more, we will then get the pseudosample, which we can use to compare with our observations. Here we need to come up with a statistic that summarizes each of our observation or pseudosample. For example, we could calculate the median % location for each sample, and ask where our observation lie relative to the pseudosamples. I’ll skip this approach for now since the first appraoch appears to be OK given the similar gene density distribution across chromosomes and species.

---
title: "Analyze XP_028889033 family evolution and adhesin properties"
author: "Bin He"
date: "2022-06-12 (updated `r Sys.Date()`)"
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
    toc_depth: 5
    code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r load_libraries, echo = FALSE}
#if (!requireNamespace("treeio", quietly = TRUE))
#    BiocManager::install("treeio")
#if (!requireNamespace("ggtree", quietly = TRUE))
#    BiocManager::install("ggtree")
#if (!requireNamespace("GenomicRanges", quietly = TRUE))
#    BiocManager::install("GenomicRanges")
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(ggtree))
suppressPackageStartupMessages(library(treeio))
suppressPackageStartupMessages(require(GenomicRanges))
```

# Goal

Analyze the adhesin properties of the Hyr/Iff-like (Hil) family in the Sacchromycetes
This is version 4 of the analysis, using the expanded blast hits on 2022-05

# Adhesin predictions
The goal of the first analysis is to assess the evidence for each Hil family member as encoding yeast adhesins. We rely on a ML-based prediction algorithm and the following known sequence features of adhesins: signal peptide + GPI-anchor, tandem repeats, high S/T frequency (possibly glycosylation)

## Basic information
First get the basic information about the sequences in this study.
```{r load_seq_info}
#sps.list <- c("Cduobushaemulonis","Cpseudohaemulonis","Chaemuloni","Cauris","Clusitaniae","Dhansenii","Cparapsilosis","Lelongisporus","Ctropicalis","Cdubliniensis","Calbicans","Sstipitis","Klactis","Ncastellii","Cglabrata","Nbracarensis","Ndelphensis","Nnivariensis")
blastInfo <- read_tsv("../data/expanded-blast-homologs-info.tsv", col_types = cols())# %>% 
#  mutate(species_id = factor(species, levels = sps.list), species = NULL)
```

## ML adhesin predictions
[FungalRV](http://fungalrv.igib.res.in/) is a Support Vector Machine (SVM) based prediction algorithms that use sequence features such as amino acid composition (frequency, physiochemical properties etc.) as input and train Machine Learning models to distinguish fungal adhesins from non-adhesins.
```{r adhesin_prediction}
frv.th = 0.511 # recommended FungalRV score threshold
frv <- read_tsv("../output/FungalRV/fungalRV-results.txt", skip = 3, col_names = c("name","frv.score"), col_types = "cd") %>% 
  mutate(name = str_sub(name, 2), frv.pred = frv.score > frv.th)
#if("frv.score" %in% names(seqInfo))
#  seqInfo <- select(seqInfo, -frv.score, -frv.pred, -faa.score, -faa.pred)
#seqInfo <- seqInfo %>% left_join(frv) %>% left_join(faa)
```

## SP and GPI-anchor prediction
GPI-anchored proteins are characterized by an N-terminal signal peptide, which would direct the protein to the secretary pathway, and a C-terminal GPI-anchor peptide, which would be cleaved and replaced by the GPI-anchor, allowing the protein to be tethered to the cell wall.

For signal peptide, I used the [SignalP server](https://services.healthtech.dtu.dk/service.php?SignalP). Its latest version is 6.0.

```{r signalP, fig.width=5, fig.height=5}
# Signal peptide
gff.names <- c("name", "source", "type", "start", "end", "prob", "na1", "na2", "na3")
signalp6 <- read_tsv("../output/web-download/signalp_6.0_result.gff3", comment = "#", col_names = gff.names, col_types = "ccciidccc")

#if("signalp" %in% names(seqInfo))
#  seqInfo <- select(seqInfo, -signalp)
#
#seqInfo <- left_join(seqInfo, select(signalp5, name = id, prob), by = c("name" = "name")) %>% 
#  mutate(signalp = !is.na(prob)) %>% select(-prob)
```

For GPI-anchor prediction, I used the [PredGPI server](http://gpcr.biocomp.unibo.it/predgpi/).
```{r gpi}
tmp <- read_delim("../output/web-download/predgpi-result-headline-only.txt", delim = "|", col_names = c("name","fp","omega"))
pred.gpi <- tmp %>% 
  mutate(name = str_sub(name,2,-2), # remove > and the trailing space
         fp = as.numeric(str_sub(fp, 9, -2)), # extract the numeric part
         gpi.pred = fp <= 0.01,    # based on the cutoff of the PredGPI server (prob < 99% -> not GPI-anchored)
         omega = str_sub(omega, 8),
         cleaveRes = str_sub(omega, 1, 1),
         cleavePos = as.integer(str_sub(omega, 3))
         )
# remove the column if it already exists
#if("pred.gpi" %in% names(seqInfo))
#  seqInfo <- select(seqInfo, -pred.gpi)
#seqInfo <- left_join(seqInfo, select(pred.gpi, name, pred.gpi = is.gpi), by = c("name"="name"))
```

## Combine the ML, SignalP and GPI-anchor predictions
```{r}
adhesin <- blastInfo %>% 
  select(name, species, len = slen, pComplete) %>% 
  left_join(frv, by = "name") %>% 
  left_join(signalp6 %>% select(name, sp.prob = prob), by = "name") %>% 
  left_join(pred.gpi %>% select(name, gpi.pred, cleavePos), by = "name") %>% 
  mutate(sp.pred = !is.na(sp.prob)) %>% 
  relocate(c(frv.pred, sp.pred, gpi.pred), .after = pComplete)
```

Export the adhesin summary table
```{r}
write_tsv(adhesin, file = "../output/table/20220818-Hil-family-size-adhesin-status-summary.tsv")
```

Plot the results alongside the species tree. First import the species tree
```{r}
spsInfo <- read_tsv("../data/20220518-expanded-blast-species-info.tsv", col_types = cols())
sps.tree <- read.tree("../data/20220521-generax-species-tree.nwk") %>% 
  as_tibble() %>% 
  mutate(label = gsub("_", " ", label)) %>% 
  left_join(spsInfo, by = c("label" = "species")) %>% 
  as.treedata()
# to label the clades
clade <- c(
  MDR = MRCA(sps.tree, "Candida auris", "Candida duobushaemulonis"),
  CaLo = MRCA(sps.tree, "Candida parapsilosis", "Candida tropicalis"),
  glabrata = MRCA(sps.tree, "Candida glabrata", "Candida nivariensis")
)
sps.tree <- groupClade(sps.tree, clade)
```

Because we will plot the results separately, it is important to generate the species tree object and extract the order of the species, so as to match the numbers to the species.
```{r}
p.tree <- ggtree(sps.tree, ladderize = FALSE) + xlim(0,2.2) + scale_y_reverse() +
  #geom_tiplab(aes(color = pathogen), as_ylab = TRUE) +
  geom_tiplab(size = 3.2, fontface = "italic", align = TRUE, linesize = 0.1, offset = 0.05) +
  geom_treescale(x = 0, width = 0.2, linesize = 1.2) +
  geom_hilight(node = clade["MDR"], fill = "#7F00FF", alpha = 0.15)  + # MDR
  geom_hilight(node = clade["CaLo"], fill = "pink", alpha = 0.25)    + # Candida/Lodderomyces
  geom_hilight(node = clade["glabrata"], fill = "steelblue", alpha = 0.15)  + # glabrata
  #geom_cladelabel(node = clade["MDR"],  label = "MDR", offset = 1.5,# color = "purple",
  #                offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# MDR
  #geom_cladelabel(node = clade["CaLo"],  label = "Candida/\nLodderomyces", offset = 1.47,# color = "hotpink2",
  #                offset.text = 0.1, angle = 270, hjust = .5, extend = 0.5, fontsize = 3.5) + # albicans
  #geom_cladelabel(node = clade["glabrata"],  label = "glabrata", offset = 1.38,# color = "steelblue", 
  #                offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# glabrata
  geom_tippoint(aes(color = pathogen)) +
  scale_color_manual(values =  c("crustacean" = "#6a5acd",
                                 "human" = "#d14949", 
                                 "human (rare)" = "steelblue",
                                 "no report" = "gray20")) +
  #guides(color = guide_legend(byrow = TRUE)) +
  theme(legend.position = c(0.12, 0.13))
```

Summarize the results
```{r}
df0 <- adhesin %>% 
  mutate(species = factor(species, levels = rev(get_taxa_name(p.tree)))) %>% 
  group_by(species) %>% 
  summarize(total = n(), FRV = sum(frv.pred), SP = sum(sp.pred), GPI = sum(gpi.pred), 
            final = sum(frv.pred & sp.pred & gpi.pred)) %>% 
  pivot_longer(total:final, names_to = "type", values_to = "number") %>% 
  mutate(type = factor(type, levels = unique(type))) %>% 
  # complete missing values for S. cerevisiae
  complete(species, type, fill = list(number = 0))

p <- ggplot(df0, aes(x = type, y = species)) + 
  scale_y_discrete(limits = rev) +
  geom_tile(aes(fill = number), color = "white", alpha = 0.4) +
  geom_text(aes(label = number), color = "black", size = 4) +
  scale_fill_distiller(palette = "Greys", direction = 1, limits = c(0, 20), oob = scales::squish) +
  scale_x_discrete(position = "top") +
  theme_cowplot() + theme(axis.title = element_blank(), axis.line = element_blank(), legend.position = "none")

ggsave(p, file = paste0("../output/img/",gsub("-", "", Sys.Date()), "-species-adhesin-prediction-summary.png"), width = 5, height = 7)
```
```{r fig.width = 8, fig.height = 6}
p1 <- p + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
plot_grid(p.tree, p1, rel_widths = c(3,2), scale = c(1, 0.95))
```
Summarize the % of Hil genes passing all three tests
```{r}
df0 %>% pivot_wider(names_from = "type", values_from = "number") %>% 
  select(species, total, final) %>% 
  summarize(total = sum(total), adhesin = sum(final))
```


## Short or no GPI-anchor
A subset of the identified Hil homologs are short and/or missing either the signal peptide or the GPI anchor. We need to first filter out ones that are incomplete (sequence record) before examining the remaining ones.

First, check the protein length distribution. _M. bicuspidata_ is an outlier in that 27 of the 29 Hil homologs in this species are shorter than 500 aa and 10 were annotated as being incomplete in the RefSeq database.
```{r}
adhesin %>% 
  mutate(`M. bicuspidata` = species == "Metschnikowia bicuspidata",
         protein = ifelse(grepl("no", pComplete), "Incomplete", "Complete")) %>% 
  ggplot(aes(x = len, fill = `M. bicuspidata`)) + 
  geom_histogram(binwidth = 100) + 
  geom_vline(xintercept = 600, linetype = 2, color = "gray20") +
  scale_fill_manual(values = c("gray50", "red3")) +
  facet_wrap(~protein, nrow = 2, labeller = "label_both") +
  xlab("Protein length") + ylab("Frequency") +
  theme_cowplot() + theme(legend.title = element_text(face = 3))
ggsave(file = "../output/img/20221002-homologs-length-distribution-Mb.png", width = 6, height = 4)
```

Filter out incomplete entries and focus on those that either are short or miss SP/GPI
```{r}
# short and missing SP/GPI
adhesin %>% 
  filter(!grepl("no", pComplete), len < 600, len > 250) %>% 
  group_by(sp.pred, gpi.pred, species) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(group = paste0(ifelse(sp.pred, "SP+", "sp-"), ifelse(gpi.pred, "GPI+", "gpi-"))) %>% 
  select(group, species, n) %>% 
  pivot_wider(names_from = group, values_from = n, values_fill = 0) %>% 
  mutate(total = rowSums(across(`sp-gpi-`:`SP+GPI+`), na.rm = TRUE)) %>% 
  arrange(desc(total)) %>% 
  write_tsv(file = "../output/table/20221003-short-and-missing-SP-GPI.tsv")

# long and missing SP/GPI
adhesin %>% 
  filter(!grepl("no", pComplete), len >= 600) %>% 
  group_by(sp.pred, gpi.pred, species) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(group = paste0(ifelse(sp.pred, "SP+", "sp-"), ifelse(gpi.pred, "GPI+", "gpi-"))) %>% 
  select(group, species, n) %>% 
  pivot_wider(names_from = group, values_from = n, values_fill = 0) %>% 
  mutate(total = rowSums(across(`sp-gpi-`:`SP+GPI+`), na.rm = TRUE)) %>% 
  arrange(desc(total)) %>% 
  write_tsv(file = "../output/table/20221003-long-and-missing-SP-GPI.tsv")

# all length
adhesin %>% 
  filter(!grepl("no", pComplete)) %>% 
  mutate(length = cut(len, breaks = c(0, 250, 600, 5000), labels = c("0-250", "251-600", ">600"))) %>% 
  group_by(sp.pred, gpi.pred, length) %>% 
  summarize(n = n(), .groups = "drop") %>% 
  mutate(group = paste0(ifelse(sp.pred, "SP+", "sp-"), ifelse(gpi.pred, "GPI+", "gpi-"))) %>% 
  select(group, length, n) %>% 
  pivot_wider(names_from = group, values_from = n, values_fill = 0) %>% 
  mutate(total = rowSums(across(`sp-gpi-`:`SP+GPI+`), na.rm = TRUE)) %>% 
  arrange(desc(total)) %>% 
  write_tsv(file = "../output/table/20221003-all-missing-SP-GPI.tsv")
  
```

# Adhesin-related characteristics

**Update 2022-09-16**

Remove _M. bicuspidata_ sequences from consideration, due to a large number of them being incomplete and possibly skewing the results.

```{r}
Mb.rm <- blastInfo %>% 
  filter(species == "Metschnikowia bicuspidata") %>% 
  select(pid, name)
```

## Serine/Threonine content

S/T sites are potential sites for O-glycosylation, which could increase the rididity of the stalk of the protein and allow the N-terminal domain to protrude out of the cell wall facing the exterior. More evidence for the importance of O-glycosylation in a serine/threonine-rich domain can be found [here](https://ec.asm.org/content/10/10/1317.long).

The goal here is to compare the S/T frequencies in the Hil proteins to the proteome average. First we need to calculate the S/T frequencies in the Hil proteins, either in the whole protein or excluding the PF11765 domain. This is done with a script in the `script` folder named `Hil-ST-freq.sh`

Read in the S/T frequencies
```{r}
ST <- list(
  full = read_tsv("../output/ST-freq/20220623-Hil-full-STfreq.out.gz", col_types = "ciii"),
  noNTD = read_tsv("../output/ST-freq/20220623-Hil-noPF11765-STfreq.out.gz", col_types = "ciii")
)
```

To determine the background frequency of Serine and Threonine in the proteome(s), I modified the `calc_aafreq_gz.py` script I wrote a long time ago for calculating the cystein and dibasic residues. I then copied four proteome fasta files, for _C. albicans_, _C. glabrata_, _S. cerevisiae_ and _C. auris_ and applied the script on them (using a wrapper script called `S-T-freq.sh`). Below I will look at both the proteome average and the distribution of S/T in individual proteins across the proteome.
```{r}
tmp.files <- list.files(path = "../output/ST-freq/", pattern = "*ST-freq.tsv.gz")
files <- file.path("../output/ST-freq", tmp.files)
names(files) <- gsub("-", " ", tmp.files) %>% word(1, 1)
bgST.freq <- files %>% 
  map(~read_tsv(., col_types = cols())) %>% 
  bind_rows(.id = "Species") %>% 
  mutate(freqS = Ser/length, freqT = Thr/length,
         Ser = NULL, Thr = NULL,
         Species = paste0(str_sub(Species, 1, 1), ". ", str_sub(Species, 2))) %>% 
  pivot_longer(cols = starts_with("freq"), names_to = "Residue", names_prefix = "freq", values_to = "frequency")
```

Is there any correlations between protein length and S/T frequency?
```{r}
bgST.freq %>% 
  ggplot(aes(x = length, y = frequency)) +
  geom_hex() + facet_grid(Residue ~ Species, scales = "free_y") + 
  scale_x_log10() + xlab("Protein length")
```
Doesn't seem to be significant.

Now let's combine the Hil protein S/T frequences as an extra "species" into the `bgST.freq`
```{r}
tmp <- ST$full %>% 
  filter(!ID %in% Mb.rm$name) %>% 
  mutate(Species = "Hil family", `S` = Ser/length, `T` = Thr/length) %>% 
  pivot_longer(cols = c(`S`, `T`), names_to = "Residue", values_to = "frequency") %>% 
  select(Species, ID, length, Residue, frequency) %>% 
  bind_rows(filter(bgST.freq, Species != "S. cerevisiae")) %>% 
  mutate(Residue = factor(Residue, levels = c("S", "T"), labels = c("Ser", "Thr")))
#tmp <- ST$noNTD %>% 
#  mutate(Species = "Hil-PF11765", S = Ser/length, `T` = Thr/length) %>% 
#  pivot_longer(cols = c(S, `T`), names_to = "Residue", values_to = "frequency") %>% 
#  select(Species, ID, length, Residue, frequency) %>%
#  bind_rows(tmp)

p <- tmp %>% 
  ggplot(aes(x = Species, y = frequency, fill = Residue)) + 
  geom_boxplot(position = position_dodge(0.9), outlier.size = 0.2, outlier.alpha = 0.5) + 
  scale_fill_manual(values = c("Thr" = "skyblue3", "Ser" = "lightblue1")) +
  scale_y_continuous(limits = c(0, 0.4), oob = scales::squish) +
  theme_cowplot() + panel_border(color = "black") +
  theme(axis.title.x = element_blank(), axis.text.x = element_text(face = 3))

p# + p1 + scale_color_manual(values = c("S" = alpha("gray20", 0.5), "T" = alpha("gray20", 0.5)))
ggsave("../output/img/20220623-Hil-ST-freq-compared-to-proteome.png", width = 5, height = 2.5)
```


```{r}
bgST.freq %>% 
  group_by(Species, Residue) %>% 
  summarize(mean = round(mean(frequency),3)) %>% 
  pivot_wider(names_from = Residue, values_from = mean)
```

<!-- ### Sliding window estimates -->
<!-- To determine the S/T frequency in the XP_028889033 homologs, I ran the program `freak` from the EMBOSS suite with the parameters of 100 aa sliding window and a step size of 10 aa. After reformating the output, the rest of the analysis is accomplished below. -->


<!-- ```{r S_T_freq} -->
<!-- # load data -->
<!-- ST.freq <- read_tsv("output/ST-freq/ST_freq_100_10_freak.out.gz", col_types = "cid") -->
<!-- S.freq <- read_tsv("output/ST-freq/S_freq_100_10_freak.out.gz", col_types = "cid") -->
<!-- T.freq <- read_tsv("output/ST-freq/T_freq_100_10_freak.out.gz", col_types = "cid") -->
<!-- # convert sequence name column to an ordered list sorted based on the gene tree sequence -->
<!-- ST.freq <- ST.freq %>%  mutate(id = ordered(id, levels = rev(genetreeOrder))) # this will produce the desired order -->
<!-- S.freq <- S.freq %>%  mutate(id = ordered(id, levels = rev(genetreeOrder))) # this will produce the desired order -->
<!-- T.freq <- T.freq %>%  mutate(id = ordered(id, levels = rev(genetreeOrder))) # this will produce the desired order -->
<!-- ``` -->

<!-- ```{r plot_ST_freq, fig.width = 6, fig.height=6} -->
<!-- ggplot(ST.freq, aes(x = id, y = pos)) +  geom_tile(aes(fill = freq)) + -->
<!--   coord_flip() + theme_classic() + scale_fill_distiller(palette = "RdGy", limits = c(0, 0.8), oob = scales::squish, breaks = seq(0, 0.6, by = 0.2)) + -->
<!--   theme(axis.text.y = element_text(size = 5), -->
<!--         axis.line.y = element_blank(), axis.ticks.y = element_blank(), -->
<!--         axis.line.x = element_blank(),# axis.ticks.x = element_blank(), -->
<!--         legend.position = c(0.85,0.35), -->
<!--         panel.background = element_rect(fill = alpha("lightblue",0.5))) + -->
<!--   ylim(1, 4500) + labs(y = "Position in sequence", x = "Sequences", color = "Frequency") +  -->
<!--   ggtitle("Serine/Threonine frequency in 100 aa sliding windows") -->
<!-- ggsave("output/img/20201223-homologs-ST-freq-100aa-window.png", bg = "transparent", width = 7, height = 7) -->
<!-- ``` -->
<!-- ```{r, fig.width=8, fig.height=7.5} -->
<!-- ST.comb <- bind_rows(Ser = S.freq, Thr = T.freq, .id = "Var") %>%  -->
<!--   mutate(Var = ordered(Var, levels = c("Ser","Thr"))) -->
<!-- p.st <- ggplot(ST.comb, aes(x = id, y = pos)) + geom_tile(aes(fill = freq), size = 2) +  -->
<!--   facet_wrap(~Var, scales = "fixed") + theme_classic() + -->
<!--   coord_flip() +  -->
<!--   #scale_fill_distiller(palette = "RdGy", limits = c(-0.1, 0.75), oob = scales::squish, breaks = seq(0,0.7,by=0.1)) + -->
<!--   scale_fill_gradient2(low = "gray10", high = "#006000", mid = "gray90", midpoint = 0.05, breaks = seq(0,0.7,by=0.1)) + -->
<!--   #scale_fill_steps2(low = "#000000", high = "#b30000", midpoint = 0.1, breaks = seq(0,0.7,by=0.1)) + -->
<!--   #scale_fill_viridis_b(n.breaks = 10, begin = 0.1) + -->
<!--   theme(axis.text.y = element_text(size = 4), axis.title.y = element_blank(), -->
<!--         axis.line.y = element_blank(), axis.ticks.y = element_blank(), -->
<!--         axis.line.x = element_blank(),# axis.ticks.x = element_blank(), -->
<!--         strip.background = element_blank(), -->
<!--         legend.position = c(0.92,0.40), -->
<!--         panel.background = element_rect(fill = alpha("lightblue",0.5))) + -->
<!--   ylim(1, 4500) + ylab("Position in sequence") + xlab("Sequences")  -->
<!-- #plot_grid(p.gtree, p.st, ncol = 2, align = 'v', rel_widths = c(1,3), scale = c(0.99,1)) -->
<!-- p.st -->
<!-- ggsave("output/img/20201223-ST-freq-composite.png", width = 7, height = 7.5) -->
<!-- ``` -->

## Tandem repeats
The non-NTD portion of the proteins evolve rapidly and many of them contain tandem repeats. Therefore, characterizing and visualizing the type, number and spatial distribution of the tandem repeats serve to highlight the differences in the non-NTD part of the proteins in this family.

To identify and group tandem repeats, I used [XSTREAM](https://amnewmanlab.stanford.edu/xstream) with the following parameters `java -Xmx1000m -Xms1000m -jar ~/sw/XSTREAM/xstream.jar $in -i.7 -I.7 -g3 -e2 -L15 -z -B -O -Asub.txt`. The parameters were chosen to identify degenerate tandem repeats that occur at least two times and the minimum length of a tandem repeat domain (=period x copy #) must be greater than 15 a.a. Please see `script/xstream.sh` for explanation of the parameters.

```{r}
tandem <- read_tsv("../output/xstream/XSTREAM_sub_i0.7_g3_m5_L15_chart.tsv",
                   col_types = "ciiifidcccd", comment = "#")
# now let's create a tibble for plotting, which would contain each instance of the tandem repeat on a separate row
tandem.div <- tandem %>% 
  rowwise(name) %>% 
  summarize(div = list(c(seq(from = start, to = end, by = period), end)), .groups = "drop") %>% 
  unnest(div)
```

Examine the distribution of tandem repeat length as a fraction of the total protein minus the PF11765 domain
```{r}
# calculate the length of the PF11765 domain in each protein
pf11765 <- read_tsv("../data/20220623-expanded-blast-combined-homologs-PF11765-longname.BED", 
                    col_names = c("name", "start", "end"), col_types = cols())

pf11765.len <- pf11765 %>% 
  mutate(len = end - start) %>% 
  group_by(name) %>% 
  summarize(dmLen = sum(len), .groups = "drop") %>% 
  arrange(desc(dmLen))

# calculate repeat content
tr.len <- tandem %>% 
  mutate(trLength = end - start + 1) %>% 
  group_by(name) %>% 
  summarize(trLen = sum(trLength), .groups = "drop") %>% 
  right_join(select(blastInfo, name, seqLen = slen, species),
             by = "name") %>% 
  mutate(trLen = ifelse(is.na(trLen), 0, trLen)) %>% 
  full_join(pf11765.len, by = "name") %>%
  mutate(nonNTDLen = seqLen - dmLen, tr.perc = trLen/nonNTDLen) %>% 
  full_join(select(adhesin, name, frv.pred, sp.pred, gpi.pred), by = "name") %>% 
  relocate(species, .after = last_col())

# remove M. bicuspidata
tr.len <- filter(tr.len, species != "Metschnikowia bicuspidata")
```

Shorter proteins seem to have a lower content of tandem repeats as a percentrage of their non NTD portion of the protein.
```{r}
tr.len %>% 
  mutate(group = cut(seqLen, breaks = c(0,500,1000,1500,5000), 
                     labels = c("<500 aa","<1000 aa","<1500 aa",">=1500 aa"))) %>% 
  ggplot(aes(x = group, y = tr.perc)) + 
  geom_boxplot(outlier.shape = NA, width = 0.4) +
  geom_jitter(aes(color = frv.pred), width = 0.2, size = 1.5, alpha = 0.8) +
  scale_color_manual(name = "FungalRV", values = c("TRUE" = "black","FALSE" = "gold2")) +
  ylab("% of total protein (remove PF11765)") + xlab("Protein length") + #coord_flip() +
  theme_cowplot() + panel_border(color = "black")
ggsave("../output/img/20220624-tandem-repeat-proportion.png", width = 6, height = 4)
```
Test for correlations between sequence length and tandem repeat content
```{r}
# https://stackoverflow.com/questions/7549694/add-regression-line-equation-and-r2-on-graph
# # GET EQUATION AND R-SQUARED AS STRING
# # SOURCE: https://groups.google.com/forum/#!topic/ggplot2/1TgH-kG5XMA
# modified by Bin He 2021-06-28
lm_eq <- function(m){
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
         list(a = format(unname(coef(m)[1]), digits = 4),
              b = format(unname(coef(m)[2]), digits = 4),
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));
}

lm.obj <- lm(seqLen ~ trLen, tr.len)
```

```{r}
tr.len %>% 
  ggplot(aes(x = trLen, y = seqLen)) + 
  geom_smooth(method = "glm", formula = y~x, se = FALSE) +
  geom_point(size = 1, alpha = 0.9) + 
  annotate(geom = "text", x = 2000, y = 500, label = lm_eq(lm.obj), parse = TRUE, size = rel(5)) +
  xlab("Tandem repeat length (aa)") + ylab("Total protein length (aa)") +
  #coord_trans(x = "sqrt", y = "sqrt") +
  #scale_color_manual(name = "FungalRV", values = c("TRUE" = "black","FALSE" = "gold2")) +
  theme_cowplot(font_size = 16) + panel_border(color = "black")
ggsave("../output/img/20220625-protein-nonNTD-length-driven-by-tandem-repeat.png", width = 5, height = 4.5)
```

> Tandem repeat content is highly correlated to the total protein length, with an estimated slope of close to 1, suggesting that protein length evolution is strongly driven by the acquirement, expansion or contraction of tandem repeats.

Summary stats of sequence length
```{r}
tr.len %>% summarize(across(c(seqLen, nonNTDLen), list(mean = mean, sd = sd, median = median)))
```

## TANGO prediction of β-aggregation prone sequences

The amyloid-like $\beta$-aggregation prone sequences have the ability to mediate self-aggregation, which boosts the local concentration of the adhesin molecules on the cell-surface. Similar to the S/T frequency above, we would like to use the output from the prediction algorithm, TANGO, to visulize the distribution of such sequence motifs along the length of the XP_028889033 homolog sequences.

TANGO predictions and the subsequent extraction of $\beta$-aggregation prone sequences accomplished by `../script/parse_tango_output.R`

How many TANGO predicted β-aggregation sequences does each homolog has, either in the entire protein or in the non-NTD portion? To do so I'll take advantage of the GRanges package to distinguish TANGO hits that are within or outside the PF11765 domain(s) for each protein. First, I need to convert the feature table into a GRanges object with the sequence names used as chromosome names and start and end as the IRanges start and end. Then I'll convert the tango output also into a GRanges object in the same way, with the exception of keeping the median as the score field.
```{r}
# read in the summary results
tango <- read_tsv("../data/tango_summary_table.tsv.gz", col_types = "cciiidddicc") %>% 
  left_join(blastInfo %>% select(id = pid, name), by = "id") %>% 
  filter(!id %in% Mb.rm$pid)

pf11765.gr <- pf11765 %>% 
  select(seqname = name, start, end) %>% 
  makeGRangesFromDataFrame(ignore.strand = TRUE)

tango.gr <- tango %>% 
  select(seqname = name, start, end, score = median) %>% 
  makeGRangesFromDataFrame(keep.extra.columns = TRUE, ignore.strand = TRUE)

# count the number of tango hits in the pf11765 domain
tango.gr.in <- subsetByOverlaps(tango.gr, pf11765.gr) %>% 
  as_tibble() %>% 
  mutate(InNTD = TRUE) %>% 
  select(name = seqnames, start, end, InNTD)

tango <- tango %>% 
  mutate(InNTD = FALSE) %>% 
  rows_update(tango.gr.in, by = c("name", "start", "end"))

# remove intermediate objects
rm(list = c("tango.gr", "tango.gr.in", "pf11765.gr"))
```

Calculate summary statistics:
```{r}
tango.sum <- tango %>% 
  group_by(name) %>% 
  summarize(n.ntd = sum(InNTD), n.ntd.gt30 = sum(round(median,0) >= 30 & InNTD),
            n.rest = sum(!InNTD), n.rest.gt30 = sum(round(median,0) >= 30 & !InNTD)) %>% 
  left_join(select(blastInfo, name, seqLen = slen), by = "name")

bind_rows(
  PF11765 = table(cut(tango.sum$n.ntd, breaks = c(0,1,3,5,10,Inf), right = FALSE)),
  `PF11765, >30` = table(cut(tango.sum$n.ntd.gt30, breaks = c(0,1,3,5,10,Inf), right = FALSE)),
  Rest = table(cut(tango.sum$n.rest, breaks = c(0,1,3,5,10,Inf), right = FALSE)),
  `Rest, >30` = table(cut(tango.sum$n.rest.gt30, breaks = c(0,1,3,5,10,Inf), right = FALSE)),
  .id = "group"
)
```

> In sum, there are more TANGO predicted β-aggregation prone sequences in the PF11765 domain than in the rest of the proteins. This is likely a result of many short proteins not having an extensive number.

Examine the distribution of the number of TANGO hits and their spatial organization within each protein. One difference from my previous analysis is that I don't exclude the hits in the PF11765 domain from the total count.
```{r}
# first count the number of out-of-NTD, median score >= 30 hits per sequence
motif.per.seq <- tango %>% 
  group_by(name) %>% 
  summarize(n.all = sum(round(median, 0) >= 30))

# next we will filter the tango dataset in order to recalculate the intervals
# this will result in 14 sequences to be dropped since they have 0 hits meeting
# the criteria. we will add them back by right_join() with the tibble above
motif.per.seq <- tango %>% 
  # limit to hits with median score >= 30
  filter(round(median,0) >= 30) %>% 
  group_by(name) %>% 
  # recalculate the interval since we are now limiting the hits to >= 30
  mutate(interval = start - lag(end) - 1) %>% 
  # summarize the results at a sequence level
  summarize(n.type = length(unique(seq)),
            n.all = n(),
            medScore = round(mean(median),1),
            IVT = round(mean(ivt),2),
            avg.intv = round(mean(interval, na.rm = T),1), 
            IQR.intv = round(IQR(interval, na.rm = T)/1.349 ,1),
            # median absolute deviation is a robust measure of the scale parameter
            # https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/mad
            mad.intv = round(mad(interval, na.rm = T),1),
            seqs = paste(unique(seq), collapse = ","),
            .groups = "drop_last") %>% 
  right_join(motif.per.seq, by = c("name", "n.all")) %>% 
  arrange(desc(n.all), desc(mad.intv))
DT::datatable(motif.per.seq)

# add extra information for plotting below
motif.per.seq <- motif.per.seq %>% 
  left_join(select(blastInfo, species, name), by = "name") %>% 
  mutate(mad.intv = ifelse(n.all > 2, mad.intv, NA),
         reg.spaced = ifelse(n.all >= 5 & mad.intv < 5, TRUE, FALSE),
         species = ordered(species, levels = get_taxa_name(p.tree)))
```

<!-- Export the tango motif per seq result -->
<!-- ```{r} -->
<!-- motif.per.seq1 %>%  -->
<!--   mutate(id = gsub("_[a-zA-Z]+$", "", name), -->
<!--          species = gsub(".*_([A-Z])([a-z]+)$", "\\1\\. \\2", name)) %>%  -->
<!--   select(id, species, species_gr, n.type, n.all, avg.intv, mad.intv, seqs) %>%  -->
<!--   write_tsv("output/tango/20210904-tango-summary-table.tsv") -->
<!-- ``` -->

```{r}
#  mutate(Clade2 = ifelse(species %in% clavispora, "Clavispora",
#                         ifelse(species %in% candida, "Candida", 
#                                ifelse(species %in% saccharo, "Saccharomycetaceae", "other"))))
p0 <- motif.per.seq %>% 
  ggplot(aes(x = species)) + geom_bar() + coord_flip() +
  # thanks to https://stackoverflow.com/questions/10834382/ggplot2-keep-unused-levels-barplot
  scale_x_discrete(drop = FALSE) +
  scale_y_continuous(minor_breaks = NULL) + 
  ylab("Hil family size")

col.regspc <- c("TRUE" = "#af8400", "FALSE" = alpha("grey30", 0.5))
p1 <- motif.per.seq %>% 
  ggplot(aes(x = species, y = n.all)) + 
  geom_jitter(aes(color = reg.spaced, size = reg.spaced), width = 0.2, height = 0) +
  scale_x_discrete(drop = FALSE) +
  scale_size_manual(values = c(.8,1.8), guide = "none") +
  scale_color_manual(name = "Regularly spaced", values = col.regspc) +
  coord_flip() + ylab("# TANGO hits/protein")# +
  #scale_y_continuous(trans = "pseudo_log", breaks = c(0,1,2,5,10,20,40),
  #                   minor_breaks = NULL)
legend <- get_legend(p1 + guides(color = guide_legend(nrow = 1), pch = guide_legend(nrow = 1)) +
                       theme(legend.position = "bottom"))
  
p2 <- motif.per.seq %>% 
  mutate(mad.intv = ifelse(is.na(mad.intv), 500, mad.intv)) %>% 
  ggplot(aes(x = species, y = mad.intv)) + 
  geom_jitter(aes(color = reg.spaced), size = 2.2, width = 0.3, height = 0) +
  #geom_dotplot(aes(fill = reg.spaced), binaxis = "y", binwidth = 0.5, 
  #             stackdir = "center", width = 0.3, binpositions = "all") +
  #scale_shape_manual(name = "Regularly spaced", values = c(21, 19)) +
  scale_color_manual(name = "Regularly spaced", values = col.regspc) +
  scale_y_continuous(trans = "pseudo_log", 
                     breaks = c(0,5,25,100,500), minor_breaks = NULL) +
  coord_flip() + ylab("Variation in spacing (MAD)")

p3 <- theme_bw(base_size = 14) + theme(axis.text.y = element_blank(), axis.title.y = element_blank(), legend.position = "none")
prow <- plot_grid(p0 + p3, p1 + p3, ncol = 2, rel_widths = c(1.5,2))

plot_grid(legend, prow, ncol = 1, rel_heights = c(.1,1))
#plot_grid(p.tree, p0+p3, p1 + p3, nrow = 1, align = "h", rel_widths = c(4,1,2), scale = c(1,0.9,0.9))

ggsave("../output/img/20220915-tango-hits-number-and-interval.png", width =3.5, height = 4.5)
```

we need a species tree on the side
```{r}
p.sps.tree.side <- ggtree(sps.tree, ladderize = FALSE) + xlim(0,2.4) + scale_y_reverse() +
  geom_tiplab(aes(label = paste(str_sub(word(label, 1), 1, 1), word(label, 2), sep = ". ")),
              size = 3.2, fontface = "italic", align = TRUE, linesize = 0.1, offset = 0.05) +
  #geom_treescale(x = 0, width = 0.2, linesize = 1.2) +
  geom_hilight(node = clade["MDR"], fill = "#7F00FF", alpha = 0.15)  + # MDR
  geom_hilight(node = clade["CaLo"], fill = "pink", alpha = 0.25)    + # Candida/Lodderomyces
  geom_hilight(node = clade["glabrata"], fill = "steelblue", alpha = 0.15)  + # glabrata
  #geom_cladelabel(node = clade["MDR"],  label = "MDR", offset = 1.5,# color = "purple",
  #                offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# MDR
  #geom_cladelabel(node = clade["CaLo"],  label = "Candida/\nLodderomyces", offset = 1.47,# color = "hotpink2",
  #                offset.text = 0.1, angle = 270, hjust = .5, extend = 0.5, fontsize = 3.5) + # albicans
  #geom_cladelabel(node = clade["glabrata"],  label = "glabrata", offset = 1.38,# color = "steelblue", 
  #                offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# glabrata
  geom_tippoint(aes(color = pathogen)) +
  scale_color_manual(values =  c("crustacean" = "#6a5acd",
                                 "human" = "#d14949", 
                                 "human (rare)" = "steelblue",
                                 "no report" = "gray20"),
                     guide = "none")
p.sps.tree.side
ggsave(filename = "../output/img/20220916-species-tree-side.png", width = 3.5, height = 5)
```


# Feature map for homologs
The goal is to produce a schematic plot for each homolog outlining their main features, such as the locations of the PFam domains (mainly the Hyp_reg_CWP), locations of the signal peptide and GPI-anchor, distribution of TANGO sequences. Note that all these features can be represented as a range with associated metadata. So the first step is to collect the coordinates of the features

## Gene tree for organizing the features
```{r}
# load the gene tree
gene.tree <- read.nhx(file = "../data/20220512-generax-clustalo-shen2018-wScer-gene-tree.nhx") %>% 
  drop.tip(tip = Mb.rm$name)
# add supplemental information
clades <- sps.tree %>% as_tibble() %>% select(treeName, group) %>% na.omit()
gene.tree <- left_join(gene.tree, select(spsInfo, treeName, family), by = c("S" = "treeName")) %>% 
  left_join(clades, by = c("S" = "treeName"))# %>% 
  #mutate(family = forcats::fct_relevel(family, "Metschnikowiaceae", after = Inf))
# label selected species to show the clustering
selected_nodes <- gene.tree %>% as_tibble() %>% 
  filter(S %in% c("Candida_albicans", "Candida_glabrata", "Candida_auris")) %>% 
  pull(node)
# color by family
clade.cols <- c(
  "CaLo" = "firebrick",
  "MDR" = "#7F00FF",
  "glabrata" = "steelblue"
)
```

```{r fig.width=5, fig.height=7}
p.gene.tree <- ggtree(gene.tree, size = 0.4) + xlim(0,4) + 
  #geom_nodepoint(aes(fill = D), data = td_filter(D == "Y"), shape = 21, size = 1, color = "red") + 
  geom_tippoint(aes(color = group), size = 1) +
  geom_tiplab(aes(color = family),
               align = TRUE, linesize = 0.1, size = 1, offset = 0.05) +
  geom_nodelab(aes(x = branch, label = node), size = 1) +
  #geom_cladelab(node = 357,  label = "M. bicuspidata", offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +
  #geom_text(label = "D", data = td_filter(D == "Y"), hjust = -0.4, size = 1.5, color = "red")# +

  scale_color_manual(name = "Clade", values = clade.cols)# +
p.gene.tree
ggsave(filename = "../output/img/20220916-gene-tree-side.png", width = 5, height = 7)
```

Extract gene tree order
```{r}
genetreeOrder <- get_taxa_name(p.gene.tree)
```

## Organize and combine the tandem repeats features
```{r}
# summarize stats of tandem repeats
repeats <- tandem %>% 
  group_by(type, period) %>% 
  summarize(n = n(), copyMean = mean(copyN), .groups = "drop") %>% 
  mutate(length = period * copyMean)

tr <- tandem %>% 
  left_join(select(repeats, type, copyMean), by = c("type" = "type")) %>% 
  filter(name %in% genetreeOrder) %>% 
  mutate(type = paste0("TR-", type),
         tip = paste0(consensus_nogap,
                      "\ntype: ", type,
                      "\nperiod: ", period, 
                      "\ncopyN: ", copyMean),
         name = ordered(name, levels = rev(genetreeOrder))) %>% 
  select(name, type, start, end, tip)
```

Make a `seqLen` object to draw the overall length of each protein
```{r}
seqLen <- blastInfo %>% 
  filter(name %in% genetreeOrder) %>% 
  mutate(start = 1, 
         name = ordered(name, levels = rev(genetreeOrder))) %>% 
  select(name, start, end = slen)
```

```{r}
# GPI-anchor
# use pred.gpi
# feature set
# structure: id  feature  start  end
feature <- bind_rows(
  Hyphal_reg_CWP = pf11765,
  # extend the signal peptide segment by 10 amino acids to make it more visible
  `Signal Peptide` = signalp6 %>% mutate(end = end + 10) %>% select(name, start, end),
  # extend the GPI-anchor C-terminus segment by 20 amino acids to make it more visible
  `GPI-anchor` = pred.gpi %>% filter(gpi.pred) %>%
    left_join(select(blastInfo, name, slen), by = "name") %>% 
    mutate(start = cleavePos-10, end = slen) %>% 
    select(name, start, end),
  `Tandem Repeats` = tr %>% select(name, start, end),
  .id = "type"
) %>% filter(!name %in% Mb.rm$name)

feature <- feature %>% 
  mutate(name = ordered(name, levels = rev(genetreeOrder)),
         type = ordered(type, levels = c("Hyphal_reg_CWP", "Signal Peptide", "GPI-anchor", "Tandem Repeats")))
feature.colors <- c("Hyphal_reg_CWP" = "#3d85c6", "Signal Peptide" = "#cc0000", "GPI-anchor" = "#6a3d9a", "Tandem Repeats" = "#af8400bb")
```

## Plot domain organization
```{r plot_tandem_repeats, warning=FALSE}
h = 1.2
# plot
p0 <- ggplot(seqLen, aes(x = name, y = start, xend = name, yend = end)) + 
  geom_segment(color = "gray80", size = h)
p1 <- geom_segment(data = feature,  aes(color = type), size = h)
p2 <- geom_segment(data = tandem.div, aes(x = name, xend = name, y = div, yend = div + 1.5),
                   size = h, color = "white")
p3 <- p0 + p1 + coord_flip() + theme_classic() + 
  scale_color_manual(values = feature.colors) +
  scale_y_continuous(expand = expansion(mult = c(0.02, 0.02))) +
  theme(axis.text.y = element_text(size = 3), axis.title.y = element_blank(),
        axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
        legend.position = c(0.8, 0.9), 
        legend.text = element_text(size = 12), legend.title = element_text(size = 14),
        plot.title = element_text(hjust = 0.5), 
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  labs(y = "Position in sequence", x = "Sequences", color = "FEATURES")
#plot_grid(p.gtree, p3, ncol = 2, align = 'v', rel_widths = c(1,2), scale = c(1.01,1))
ggsave("../output/img/20220916-domain-tandem-repeats.png", plot = p3, width = 6, height = 7.5)
```
**Fig. ?** Blue boxes indicate the PF11765 domains while all other non-grey boxes indicate XSTREAM-determined tandem repeat domains. Colors are used to group highly similar tandem repeats. The black thin lines demarcate adjacent tandem repeat units. The table below shows the copy number, period and consensus sequence for each tandem domain organized by the host sequences.

## Separate TR types

```{r}
#DT::datatable(
#  tandem %>% 
#    dplyr::rename(seqL = seqLength, err = consensus_error, seq = consensus_nogap) %>% 
#    select(-seqAlign, -type, -consensus_gap, -seq, seq) %>% 
#    arrange(desc(name)),
#  fillContainer = FALSE, options = list(pageLength = 10)
#)
```

Repeat the above analysis but distinguishing between all different TR types
```{r set_tr_color}
require(RColorBrewer)
tr.th <- 100 # arbitrary threshold for distinguishing "short" from "long" TRs
tr.col <- character(nrow(repeats)) # create a color vector
short.rp <- which(repeats$length < tr.th) # identify the short repeats indices
long.rp <- setdiff(1:nrow(repeats), short.rp) # the long repeats indices
set.seed(123) # for reproducibly shuffling the order before assigning the colors
short.rp <- sample(short.rp) # shuffle the indices for short tandem repeats
tr.col[short.rp] <- colorRampPalette(brewer.pal(12, "Paired")[seq(3,11,by=2)])(length(short.rp)) # assign the short repeats a lower contrast color
set.seed(231) # for reproducibly shuffling the order before assigning the colors
long.rp <- sample(long.rp)
tr.col[long.rp] <- colorRampPalette(brewer.pal(12, "Paired")[seq(4,12,by=2)])(length(long.rp)) # assign the long repeats a higher contrast color
# -- desaturate the colors -- 
# https://stackoverflow.com/questions/26314701/r-reducing-colour-saturation-of-a-colour-palette
library(colorspace)   ## hsv colorspace manipulations

## Function for desaturating colors by specified proportion
desat <- function(cols, sat=0.5) {
    X <- diag(c(1, sat, 1)) %*% rgb2hsv(col2rgb(cols))
    hsv(X[1,], X[2,], X[3,])
}

tr.col <- desat(tr.col, sat = 0.8)
# -- finish --
tr.col <- paste0(tr.col, "CC") # add 20% transparency to the TR features
names(tr.col) <- paste0("TR-", repeats$type) # name the colors by the TR types
repeats$color <- tr.col
```

Combine domains, SP and GPI-anchor with TR features.
```{r} 
# combine sequence features with tandem repeats
feature1 <- bind_rows(filter(feature, type != "Tandem Repeats"), tr) %>% 
  mutate(type = ordered(type, levels = c("Hyphal_reg_CWP", "SignalP", "GPI-anchor", unique(tr$type))))
feature.col1 <- c(feature.colors[1:3], tr.col) # remove the singel "Tandem Repeats" type
#write_tsv(feature1, file = "../output/misc/R-feature-table.tsv", col_names = TRUE)
```

```{r, warning=FALSE}
# plot
p1 <- geom_segment(data = feature1, aes(color = type, text = tip), size = h)
p2 <- geom_segment(data = tandem.div, aes(x = name, xend = name, y = div, yend = div + 1.5),
                   size = h, color = "white")
p3 <- p0 + p1 + coord_flip() + theme_classic() + 
  scale_color_manual(values = feature.col1) +
  scale_y_continuous(expand = expansion(mult = c(0.02, 0.02))) +
  theme(axis.text.y = element_text(size = 3), axis.title.y = element_blank(),
        axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
        legend.position = "none", plot.title = element_text(hjust = 0.5),
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  labs(y = "Position in sequence", x = "Sequences", color = "FEATURES")
# plot_grid(p.gtree, p3 + p2, ncol = 2, align = 'v', rel_widths = c(1,2), scale = c(1.01,1))
ggsave("../output/img/20220916-domain-tandem-repeats-distinct-color.png", plot = p3, width = 5, height = 7.5)
```

```{r, warning=FALSE, fig.height=9, fig.width=9}
# plot
#require(plotly)
plotly::ggplotly(p3, tooltip = "text", width = 900, height = 900)
```

## Tango predicted β-aggregation sequences
```{r tango_sequences, fig.width=6, fig.height=8}
# reorder the sequences for plotting
# plot
#p1 <- ggplot(seqLen, aes(x = name, y = start, xend = name, yend = end)) + geom_segment(color = "gray80", size = 2)
p1 <- geom_segment(data = filter(pf11765, !name %in% Mb.rm$name), 
                   aes(x = name, xend = name, y = start+1, yend = end), size = h, color = "#3d84c6")
p2 <- geom_segment(data = filter(tango, !name %in% Mb.rm$name), 
                   aes(x = name, xend = name, y = ifelse(start-4 >= 0, start-4, 0), yend = end + 4, color = median), size = h)
p3 <- p0 + p1 + p2 + coord_flip() + theme_classic() + 
  scale_color_viridis_c(limits = c(5,101), breaks = c(5,seq(25,100,25)), direction = -1) +
  theme(axis.text.y = element_text(size = 4), axis.title.y = element_blank(),
        axis.line.y = element_blank(), axis.ticks.y = element_blank(),
        axis.line.x = element_blank(), axis.ticks.x = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(0.8,0.88),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        panel.background = element_rect(fill = alpha("lightblue",0.1))) +
  ylim(-2, 4500) + labs(y = "Position in sequence", x = "Sequences", color = "TANGO score")
p3
##plot_grid(p.gtree, p4, ncol = 2, align = 'v', rel_widths = c(1,2.5), scale = c(1.01,1))
ggsave("../output/img/20220916-tango-score-segment.png", plot = p3, width = 6, height = 7.5)
```

To compare the number of TANGO motifs in _C. auris_ Hil1-4 and their close homologs to the rest of the Hil family, we first identify the first group of sequences
```{r}
a <- which(genetreeOrder == "XP_025344416.1_Candida_haemuloni", arr.ind = TRUE)
b <- which(genetreeOrder == "XP_024716365.1_Candida_pseudohaemulonii", arr.ind = TRUE)
hil1_4_mdr <- genetreeOrder[a:b]
rm(list = c("a", "b"))
```

```{r}
motif.per.seq %>% 
  select(name, n.all) %>% 
  mutate(group = ifelse(name %in% hil1_4_mdr, "Hil1-4_MDR", "others")) %>% 
  group_by(group) %>% 
  summarize(median = median(n.all))
```


_**Discussion**_

- The one sequence below 500 a.a. is from _N. delphensis_, which is labeled as a partial CDS.
- Majority of the proteins in the list are 500-2000 a.a., with a few exceptionally long
- Not only do Saccharomycetaceae species have fewer Hil family homologs, the ones they have are also short (< 1000 a.a.) with the exception of _C. glabrata_



# Chromosomal locations

## Are members of this protein family enriched in the subtelomeric regions?
A recent long-read sequencing study for _C. glabrata_ annotated 31 novel ORFs, of which 24 are GPI-Cell Wall Proteins. The authors cited previous literature supporting the  in the subtelomeric regions [Xu 2020](https://doi.org/10.1111/mmi.14488). While performing BLAST on FungiDB to identify homologs of our protein, I also noticed that many of the hits appear to be at the beginning and end of the chromosomes.

To test if there is indeed a significant enrichment among this family of proteins in the subtelomeric regions, I collected the chromosomal locations for a subset of the species whose genomes were assembled to a chromosomal or complete genome level (the two are somewhat equivalent). These include _Candida albicans, Candida auris, Candida dubliniensis, Candida orthopsilosis, Debaryomyces hansenii, Kazachstania africana, Kazachstania naganishii, Kluyveromyces lactis, Naumovozyma castellii, Naumovozyma dairenensis, Candida glabrata, Scheffersomyces stipitis_

To formally test this hypothesis, I need to account for the background gene density differences along the chromosomes. The idea is to compare the chromosomal positions for this group of proteins compared with the gene densities on the chromosomes they reside on.

One consideration for this analysis is that if we simply use all the species for which a well assembled genome is available, we may end up over-sampling a phylogenetic clade and having its Hil family's localization over-weighted in the final test. To avoid giving too much weight on a group of species, I will select a representative species if more than one is available in a relatively closely related clade. For example, we will use _Candida albicans_ and remove _Candida dubliniensis_.

```{r decide_which_species_to_use}
# import the chromosomal location information
chrLoc <- read_tsv("../data/20220919-expanded-blast-chromosomal-locations.tsv", col_types = cols())
# decide the species to be used for the analysis
use.sps <- c("Candida albicans", "Debaryomyces hansenii", "Candida orthopsilosis",
             "Kazachstania africana", "Kluyveromyces lactis",
             "Naumovozyma dairenensis", "Candida auris", "Candida glabrata")
# create a new tibble for our homologs from these species
fg.freq <- chrLoc %>% filter(species %in% use.sps)
# import assembly info
assembly.info <- read_tsv("../data/20220525-expanded-blast-species-assembly-info.tsv",
                          col_types = cols())
use.sps <- assembly.info %>% 
  filter(species %in% use.sps) %>%
  filter(!species %in% c("Candida glabrata", "Candida auris", "Kluyveromyces lactis")) %>% 
  select(species, assembly = assemblyaccession) %>% 
  # manually add several species. note that we are using the B11245 (Clade 4) strain to 
  # represent C. auris, because it is assembled to a chromosome level and has annotation files
  add_row(species = c("Candida glabrata", "Candida auris", "Kluyveromyces lactis"),
          assembly = c("GCA_010111755.1", "GCA_008275145.1", "GCF_000002515.2"))
```

To conduct this test, we first need to prepare and compute the background gene densities. To do this, we will gather the genome assembly files for the selected species, read them into R, and generate a table that contains one row for each gene, with its gene ID, chromosome number, chromosome length and the start position expressed as a percentage measured from the chromosome ends.

```{r compute_background_freq}
# 1. prepare file names
#   get all file names that ends with "feature_table.txt.gz", which contain the gene annotation
feature.files <- list.files(path = "../data/assembly-info/", pattern = "*feature_table.txt.gz$")
names(feature.files) <- sapply(str_split(feature.files, pattern = "_"), function(x) {
  paste(x[1], x[2], sep = "_")
})
# get all file names that ends with "assembly_report.txt", which contain the chromosomal length
assembly.files <- list.files(path = "../data/assembly-info/", pattern = "*assembly_report.txt$")
names(assembly.files) <- sapply(str_split(assembly.files, pattern = "_"), function(x) {
  paste(x[1], x[2], sep = "_")
})
use.sps$FeatureFile <- feature.files[use.sps$assembly]
use.sps$AssemblyFile <- assembly.files[use.sps$assembly]
```

```{r}
# 2. read in the assembly information
feature.col.names <- c("feature","class","assembly","assembly_unit","seq_type","chromosome","genomic_accession","start","end","strand","product_accession","non_redundant_refseq","related_accession","name","symbol","GeneID","locus_tag","feature_interval_length","product_length","attributes")
assembly.col.names <- c("chromosome","seq_role","assign_molecule","type","gb_acc","relationship",
                        "refseq_acc","assembly_unit","seqL","ucsc_name")
compute.bg.freq <- function(row){
  assembly.file <- row["AssemblyFile"]
  assembly <- read_tsv(paste0("../data/assembly-info/",assembly.file), comment = "#",
                       col_names = assembly.col.names, col_types = "ccccccccic")
  feature.file <- row["FeatureFile"]
  feature <- read_tsv(paste0("../data/assembly-info/",feature.file), col_names = feature.col.names,
                      col_types = "ccccccciicccccccciic", skip = 1)
  res <- feature %>% 
    filter(feature == "mRNA") %>% 
    # these feature tables are organized hierarchically, with the top level being "gene"
    # the next level one of "mRNA", "ncRNA", "tRNA" or "rRNA". we only count protein-coding genes, i.e.
    # "mRNA". the reason I didn't select the "CDS" feature type is because in a small number of cases,
    # one mRNA feature contains more than one CDS feature, possibly due to splicing or alternative 
    # translational start site
    select(chromosome, start, end) %>% 
    left_join(assembly %>%
                mutate(chraccver = ifelse(refseq_acc != "na", refseq_acc, gb_acc)) %>% 
                select(chromosome, chraccver, seqL), 
              by = c("chromosome" = "chromosome")) %>%
    mutate(relLoc = round(start / seqL, 3))
  return(res)
}
```

```{r}
# 3. apply the function to the genomes
bg.freq <- apply(use.sps, MARGIN = 1, compute.bg.freq)
names(bg.freq) <- use.sps$species
```

Let's take a look at the gene density in one of the genomes, e.g. _D. hansenii_

Below I use the `geom_histogram` function instead, with breaks specified manually. This results in a density distribution that is pretty flat across the chromosomes.
```{r}
bg.freq$`Debaryomyces hansenii` %>% ggplot(aes(x = relLoc)) + 
  geom_histogram(breaks = seq(0,1,0.05)) +
  facet_wrap(~ chromosome) + ggtitle("D. hansenii") + theme(plot.title = element_text(hjust = 0.5))
```

In order to compare the distribution of _all_ genes in different bins of the chromosomes to the homologs in our case study, we can divide each chromosome in the `nrow(use.sps)` genomes into an arbitrary number of bins after "folding" them in half, e.g. 0-10%, 10-20%, 20-30%, 30-40% and 40-50%. To be able to visually compare the distribution of our homologs and the genome background, we will create a special "chromosome" class that will be our homologs and combine them with the `bg.freq` table.
```{r}
freq.bins <- c(-0.001, seq(0.1, 0.5, 0.1)); freq.binsL <- c(0, freq.bins[-1])
freq.label <- paste0(head(freq.binsL, -1)*100,"-",tail(freq.binsL, -1)*100,"%")
bg.freq.tb <- bind_rows(bg.freq, .id = "species") %>% 
  mutate(fold.relLoc = ifelse(relLoc <= 0.5, relLoc, 1-relLoc),
         bin = cut(fold.relLoc, breaks = freq.bins, labels = freq.label))

# add the homologs
fg.freq <- fg.freq %>% 
  mutate(fold.relLoc = ifelse(relLoc <= 0.5, relLoc, 1-relLoc),
         bin = cut(fold.relLoc, breaks = freq.bins, labels = freq.label)) 
freq.plot <- fg.freq %>%
  mutate(chromosome = "Hil") %>%  # we label the homologs as Hil to make it a separate class
  select(species = species, chromosome, bin) %>% 
  bind_rows(select(bg.freq.tb, species, chromosome, bin)) %>% 
  mutate(chromosome = ordered(chromosome, levels = c(1:12,LETTERS,"Hil"))) %>% 
  group_by(species) %>% 
  filter(sum(chromosome == "Hil") >=3) # remove species that have fewer than 3 Hil homologs
```

```{r}
# plot
freq.plot %>% 
  #mutate(species = paste0(substr(species, 1, 1), ". ", substr(species, 2, 15))) %>% 
  ggplot(aes(x = chromosome, group = bin, fill = bin)) + 
  geom_bar(position = position_fill()) +
  facet_wrap(~ species, scales = "free_x", ) + 
  #scale_fill_brewer("Distance from\nchromosome end", type = "qual", palette = 3) +
  scale_fill_viridis_d(direction = -1, end = 0.95, alpha = 0.9) +
  scale_y_continuous(name = "Cumulative % of genes", trans = "reverse", breaks = seq(0,1,0.2)) + 
  theme_cowplot() + panel_border(color = "grey80") +
  theme(legend.position = "none", strip.background = element_blank(),
        strip.text.x = element_text(face = 3),
        axis.text = element_text(size = rel(0.7)))

ggsave("../output/img/20220920-compare-homologs-chromosomal-locations-to-bg.png", width = 7, height = 4)
```
In the plot above, we can see that the distribution of the Hil proteins on the chromosome deviate from the background. Only species with three or more Hil family members are shown here.

Now that we have the background gene frequencies computed, we can start constructing a test that tests for significant departure in the chromosomal locations of our protein family from the background frequencies. One idea is to divide the chromosome into equal-sized bins and use the frequencies of _all_ genes in each bin as the multinomial probabilities in the null hypothesis. If our proteins were randomly selected from each chromosome without regard to their location, we would expect their locations to conform to the background frequencies. This can be tested using an exact multinomial test or an approximate Chi-square test or G-test. Since we don't distinguish the two ends of a chromosome, we can "fold" the chromosome "in half" and measure each gene's location as a percentage from the closer end, i.e. the relative location ranging from 0%-50%. If we can reject the null hypothesis, that constitutes evidence that the proteins from our group are not randomly selected from the background set.

**Result of multinomial test**

```{r multinomial_test}
# install XNomial package for the multinomial exact test function
# https://cran.r-project.org/web/packages/XNomial/vignettes/XNomial.html#e1
while(!require(XNomial))
  suppressMessages(install.packages("XNomial"))
# calculate pooled background frequency
bg.cnt <- tabulate(bg.freq.tb$bin)
fg.cnt <- tabulate(fg.freq$bin)
xmulti(obs = fg.cnt, expr = bg.cnt, detail = 3)
```
Note that the three _P_-values correspond to three approaches of testing the goodness-of-fit. The LLR, which stands for Log Likelihood Ratio, is generally preferred. But all three said the same thing: the observation deviates from the background frequency significantly. By looking at the plots above, it is clear that the deviation is due to the excess of our homologs residing in the 0-10% bin, which is the tip of the chromosomes.

The test above assumes that the gene density along all chromosomes in the eight species follow the same distribution. The empirical observation supports that hypothesis. Should that to be not the case, we could also account for the variability in gene density distribution between species or even between chromosomes within a species. The idea is to first collect the chromosomes that our homologs come from, and then randomly sample the **same** number of genes as the number of our homologs on that chromosome. Do this for all of the homolog-containing chromosomes, we would obtain one random sample. Repeat this process 1000 times or more, we will then get the pseudosample, which we can use to compare with our observations. Here we need to come up with a statistic that summarizes each of our observation or pseudosample. For example, we could calculate the median % location for each sample, and ask where our observation lie relative to the pseudosamples. I'll skip this approach for now since the first appraoch appears to be OK given the similar gene density distribution across chromosomes and species.